For this project we will be exploring the use of tree methods to classify schools as Private or Public based off their features.
Let's start by getting the data which is included in the ISLR library, the College data frame.
A data frame with 777 observations on the following 18 variables.
Call the ISLR library and check the head of College (a built-in data frame with ISLR, use data() to check this.) Then reassign College to a dataframe called df
Let's explore the data!
Create a scatterplot of Grad.Rate versus Room.Board, colored by the Private column.
Create a histogram of full time undergrad students, color by Private.
Create a histogram of Grad.Rate colored by Private. You should see something odd here.
What college had a Graduation Rate of above 100% ?
Change that college's grad rate to 100%
Split your data into training and testing sets 70/30. Use the caTools library to do this.
Use the rpart library to build a decision tree to predict whether or not a school is Private. Remember to only build your tree off the training data.
Use predict() to predict the Private label on the test data.
Check the Head of the predicted values. You should notice that you actually have two columns with the probabilities.
Turn these two columns into one column to match the original Yes/No Label for a Private column.
Now use table() to create a confusion matrix of your tree model.
Use the rpart.plot library and the prp() function to plot out your tree model.
Now use randomForest() to build out a model to predict Private class. Add importance=TRUE as a parameter in the model. (Use help(randomForest) to find out what this does.
What was your model's confusion matrix on its own training set? Use model$confusion.
Now use your random forest model to predict on your test set!
It should have performed better than just a single tree, how much better depends on whether you are emasuring recall, precision, or accuracy as the most important measure of the model.